Skip to content

Clamp idle runner sleep to min(base, 30s)#354

Merged
daniel-thom merged 2 commits into
mainfrom
runner-idle-poll-backoff
May 27, 2026
Merged

Clamp idle runner sleep to min(base, 30s)#354
daniel-thom merged 2 commits into
mainfrom
runner-idle-poll-backoff

Conversation

@daniel-thom
Copy link
Copy Markdown
Collaborator

A runner whose claim returns no work and whose last child has exited has nothing to react to except a timer wake-up: SIGCHLD never fires when running_jobs is empty. Letting the existing claim_backoff_max_secs ramp take over in that state delays workflow-complete, idle-exit, and end_time detection by up to the configured cap for no benefit.

When the runner is idle, the sleep now clamps to
min(job_completion_poll_interval, IDLE_BACKOFF_CAP_SECS) where IDLE_BACKOFF_CAP_SECS is a hard-coded 30s. This keeps closing-case detection responsive even when a long base interval is configured for cost reasons, while still honoring the user's preferred minimum cadence when it's tighter than 30s. The busy-at-capacity ramp is unchanged.

A runner whose claim returns no work and whose last child has exited has
nothing to react to except a timer wake-up: SIGCHLD never fires when
running_jobs is empty. Letting the existing claim_backoff_max_secs ramp
take over in that state delays workflow-complete, idle-exit, and end_time
detection by up to the configured cap for no benefit.

When the runner is idle, the sleep now clamps to
min(job_completion_poll_interval, IDLE_BACKOFF_CAP_SECS) where
IDLE_BACKOFF_CAP_SECS is a hard-coded 30s. This keeps closing-case
detection responsive even when a long base interval is configured for
cost reasons, while still honoring the user's preferred minimum cadence
when it's tighter than 30s. The busy-at-capacity ramp is unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adjusts the job runner’s adaptive backoff behavior so that when the runner is truly idle (no tracked child processes), its sleep interval is clamped to min(job_completion_poll_interval, 30s), keeping workflow-completion / idle-exit / end-time detection responsive even when a long base poll interval is configured.

Changes:

  • Add an explicit “idle (no children)” regime with a hard cap of 30s via idle_poll_interval and an is_idle flag passed into next_poll_interval.
  • Update the main loop’s wait selection to use the idle clamp when running_jobs is empty, while leaving the busy-at-capacity ramp behavior unchanged.
  • Expand unit tests and update documentation to describe the three adaptive-backoff regimes.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
src/client/job_runner.rs Introduces idle clamp helper/constant, threads is_idle through backoff computation, updates wait logic and tests.
docs/src/core/concepts/job-runners.md Documents adaptive backoff with separate busy vs. idle regimes and rationale/examples.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/client/job_runner.rs Outdated
Comment thread docs/src/core/concepts/job-runners.md Outdated
The prior phrasing said the idle wait is "never faster than base", which
is wrong when base > 30s — in that case `min(base, 30)` deliberately
polls faster than base, which is the whole point. Restate the two
guarantees the formula actually provides: the wait is at most 30s, and
never longer than the configured base.

Code behavior is unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

Comment thread src/client/job_runner.rs
/// toward `cap`. The cap is clamped to at least `base` so callers cannot
/// accidentally shrink the wait below the configured floor.
fn next_poll_interval(current: f64, base: f64, cap: f64, made_progress: bool) -> f64 {
/// In the non-idle path, progress resets the wait to `base`; an idle
Comment thread src/client/job_runner.rs
Comment on lines 3715 to 3721
#[test]
fn next_poll_interval_doubles_on_idle() {
// Empty iterations grow the wait by a factor of two until the cap.
let base = 30.0;
let cap = 300.0;
let mut current = base;
let steps = [60.0, 120.0, 240.0, 300.0, 300.0];
Comment thread src/client/job_runner.rs
Comment on lines 3766 to 3772
#[test]
fn next_poll_interval_idle_never_decreases() {
// An idle step from current=base must never return less than base.
let base = 30.0;
let cap = 300.0;
let next = next_poll_interval(base, base, cap, false);
let next = next_poll_interval(base, base, cap, false, false);
assert!(next >= base);
Comment on lines +248 to +258
| State | Wait |
| ----------------------------- | ---------------------------------------- |
| Making progress | `job_completion_poll_interval` (base) |
| Busy at capacity, no progress | doubles toward `claim_backoff_max_secs` |
| Idle (no children to reap) | `min(job_completion_poll_interval, 30s)` |

**Busy-at-capacity case (long-running workflows).** When the runner is fully loaded and nothing is
completing or being claimed, polling at the base interval would generate unnecessary requests for
hours. Each iteration with no progress doubles the wait, capped at `claim_backoff_max_secs` (default
300s). The wait resets to base immediately on any progress: a local completion, a successful claim,
or a `SIGCHLD` wake-up.
@daniel-thom daniel-thom merged commit acb0025 into main May 27, 2026
10 checks passed
@daniel-thom daniel-thom deleted the runner-idle-poll-backoff branch May 27, 2026 22:06
daniel-thom added a commit that referenced this pull request May 27, 2026
PR #354 introduced an explicit is_idle code concept ("no tracked
children") but left the surrounding docs and test names using "idle" in
the loose sense of "iteration with no progress". After the change those
two meanings collide.

- Reword next_poll_interval's doc comment to say "no-progress iteration"
  in the non-idle branch and explicitly note that no-progress does not
  imply is_idle.
- Rename next_poll_interval_doubles_on_idle ->
  next_poll_interval_doubles_on_no_progress and
  next_poll_interval_idle_never_decreases ->
  next_poll_interval_no_progress_never_decreases; both already exercise
  is_idle=false, so the new names reflect what they actually cover.
- In docs, replace the "Busy at capacity, no progress" table row and
  prose with "No progress (children still running)". The ramp engages
  on any no-progress iteration with running children, not just at
  capacity (e.g., spare slots but unmet dependencies → server returns
  no work → still ramps).

No behavior change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants